在上次的處理中尚未除去明顯的極端值,因此這次我們針對面積超過一定程度的資料進行刪除。
train = train[train.GrLivArea < 4500]
train.reset_index(drop=True, inplace=True)
train["SalePrice"] = np.log1p(train["SalePrice"])
為了將train與test的特徵一併處理,以避免之後預測狀況中產生錯誤,取出訓練集的答案之後,先將所有的資料都合併,方便我們同時處理train與test
y = train['SalePrice']
train_features = train.drop(['SalePrice'], axis=1)
test_features = test
features = pd.concat([train_features, test_features]).reset_index(drop=True)
features.shape # 觀察一下資料的大小
在最早的一篇文章中有提到,正確的變數型態才能夠正確的預測資料以及補值,因此需要先將資料正確的分類。
# 將原先是數值但應該為類別型變數的資料轉為字串
features['MSSubClass'] = features['MSSubClass'].apply(str)
features['YrSold'] = features['YrSold'].astype(str)
features['MoSold'] = features['MoSold'].astype(str)
# 將特定變數補上最適合的值
features['Functional'] = features['Functional'].fillna('Typ')
features['Electrical'] = features['Electrical'].fillna("SBrkr")
features['KitchenQual'] = features['KitchenQual'].fillna("TA")
features["PoolQC"] = features["PoolQC"].fillna("None")
# 在特定變數中補上眾數
features['Exterior1st'] = features['Exterior1st'].fillna(features['Exterior1st'].mode()[0])
features['Exterior2nd'] = features['Exterior2nd'].fillna(features['Exterior2nd'].mode()[0])
features['SaleType'] = features['SaleType'].fillna(features['SaleType'].mode()[0])